CAGEF_services_slide.png

An Introduction to Data Visualization in a Pandemic World

0.1.0 An overview of Advanced Graphics and Data Visualization in R

"An Introduction to Data Visualization in a Pandemic World" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This talk was developed to introduce participants of the Bioinformatics and Computational Biology Student Union (BCBSU) Biohacks 2021 conference to the world of R by focusing on basics concepts, methods, and packages for formatting and plotting scientific data. While the datasets and examples used in this talk are centred on SARS-CoV2 epidemiological data, the lessons learned herein can be applied broadly.

The aim for the end of this presentation is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) in your data.

The structure of the presentation is a code-along style using Jupyter notebooks. At the start of this presentation, a skeleton version will be provided for use on the University of Toronto Jupyter Hub so students can program along with the presenter.

To reproduce the repository from GitHub on your Jupyter Hub simply click on this link

0.2.0 Lecture objectives

This will be your 1-hour crash-course on Jupyter notebooks and R! At the end of this lecture we will have covered the following topics

  1. Working with Jupyter notebooks
  2. R data types, objects and working with them
  3. Long-format and tidy data principles using tidyverse
  4. Graphical analysis of data.

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

0.4.0 Data used in this presentation

Today's datasets will focus on epidemiological data from the Ontario provincial government found here.

0.4.1 Dataset 1: Ontario_daily_change_in_cases_by_phu.csv

This dataset was obtained from the Ontario provincial website and holds statistics regarding SARS-CoV-2 cases throughout different public health units in the province. It is in a comma separate format and has been collected since 2020-03-24.

0.4.2 Dataset 2: Ontario_covidtesting.csv

This dataset was obtained from the Ontario provincial website and holds statistics regarding SARS-CoV-2 throughout the province. It is in a comma separated format and has been growing/expanding since initial tracking started on 2020-01-26.


1.0.0 Coding in Jupyter Notebooks

If you'd like to code along with this presentation, please begin by clicking on the following link which will clone a GitHub repository to your personal Jupyter Hub with the University of Toronto.

Work with your Jupyter Notebook on the University of Toronto JupyterHub will all be contained within a new browser tab with the address bar showing something like

https://jupyter.utoronto.ca/user/assigned-username-hexadecimal/tree/BCBSU_Biohacks_2021

All of this is running remotely on a University of Toronto server rather than your own machine.

You'll see a directory structure from your home folder:

ie \BCBSU_Biohacks_2021\ Clicking on that, you'll find Intro_R_dataViz.skeleton.ipynb which is the notebook we will use for today's code-along talk.


1.1.0 Why are we using Jupyter Notebooks?

This presentation has been implemented on this platform to reduce the burden of having to install various programs. While installation can be a little tricky, it's really not that bad. For this introductory talk, however, you don't need to go through all of that just to learn the basics of coding.

Jupyter Notebooks also give us the option of inserting "markdown" text much like what your reading at this very exact moment. So we can intersperse ideas and information between our learning code blocks.

There is, however an appendix section at the end of this lecture detailing how to install Jupyter Notebooks (and the R-kernel for them) as well as independent installation of the R-kernel itself and a great integrated development environment (IDE) called RStudio.


1.2.0 Jupyter notebooks run programming language kernels like R

Behind the scenes of each Jupyter notebook a programming kernel is running. For instance, depending on the setup, our notebooks can run a true or "emulated" R-kernel to interpret each code cell as if it were written specifically for the R language.

As we move from code cell to new code cell, all of the variables or objects we have created are stored within memory. We can refer to these as we run the code and move forward but if you overwrite or change them by mistake, you may to have rerun multiple cell blocks!

There are some options in the "Cell" menu that can alleviate these problems such as "Run All Above". If you think you've made a big error by overwriting a key object, you can use that option to "re-initialize" all of your previous code!

Remember these friendly keys/shortcuts:

In Command mode


1.2.1 Why would you want to use a Jupyter Notebook?

Depending on your needs, you may find yourself doing the following:

Jupyter allows you to alternate between "markdown" notes and "code" that can be run or re-run on the fly.

Each data run and it's results can be saved individually as a new notebook or as new cells to compare data and small changes for your analyses!


1.3.0 Packages contain useful functions that we'll use often

So... what are in these packages? A package can be a collection of

Functions are the basic workhorses of R; they are the tools we use to analyze our data. Each function can be thought of as a unit that has a specific task. A function takes an input, evaluates it using an expression (e.g. a calculation, plot, merge, etc.), and returns an output (a single value, multiple values, a graphic, etc.).

In this course we will rely a lot on a package called tidyverse which is also dependent upon a series of other packages.

1.3.1 Packages used in this presentation

repr- a package useful for altering some of the attributes of objects related to the R kernel.

tidyverse which included a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

lubridate and zoo are helper packages used for working with date formats in R

Let's run our first code cell!


2.0.0 Foundations of R

There are many tips and tricks to remember about R but here we'll quickly recall some foundation knowledge that we'll need when further into this lesson.


2.1.0 Assigning variables

If we want to hold onto a number, calculation, or object we need to assign it to a named variable. R has multiple methods for assigning a value to a variable and an order of precedence!

-> Rightward assignment: we won't really be using this in our course.

<- Leftward assignment: assignment used by most 'authentic' R programmers but really just a historical throwback.

= Leftward assignment: commonly used token for assignment in many other programming languages but be careful as it carries dual meaning!

Notes


2.2.0 Data types

What do I mean by 'types' of data?


2.2.1 Data structures

The job of data structures is to "host" the different data types. There are five types of data structures in R:

  1. vectors - 1D - holds one type of data
  2. lists - 1D - holds multiple data types
  3. matrices - 2D - holds one type of data
  4. data frames - 2D - holds multiple data types
  5. arrays - nD - holds one type of data (nD = more than two dimensions)

data_structures.jpg


2.2.2 Vectors are like a queue of a single data type


2.2.2.1 Coercion changes data from one type to another (where applicable)

R will implicitly force (coerce) your vector to be of one data type, in this case the type that is most inclusive is a character vector. When we explicitly coerce a change from one data type to the next, it is known as casting. You can cast between certain data types and also object types.

Importantly, when coercing, the R kernel convert from more specific to general types usually in this order:


logical $\rightarrow$ integer $\rightarrow$ numeric $\rightarrow$ complex $\rightarrow$ character $\rightarrow$ list.

2.2.3 Data Frames

2.2.3.1 Object classes

Now that we have had the opportunity to create a few different vector objects, let's talk about what an object class is. An object class can be thought of as a structured with attributes that will behave a certain way when passed to a function. Because of this

Some R package developers have created their own object classes. For example, many of the functions in the tidyverse generate tibble objects. They are behave in most ways like a data frame but have a more refined print structure, making it easier to see information such as column types when viewing them quickly. In general, from a trouble-shooting standpoint, it is good to be aware that your data may need to be formatted to fit a certain class of object when using different packages.


2.2.3.2 Data frames are groups of vectors aligned as columns

Whereas matrices are 2-dimensional structures limited to a single specific type of data within each instance, data frames are more complex as each column of the structure can be treated like a vector. The data frame, however, can have multiple data types mixed across each different columns. Data frame rules to remember are:

  1. Within a column, all members must be of the same data type (ie character, numeric, Factor, etc.)
  2. All columns must have the same number of rows (hence the matrix shape)

Data frames allows us to generate tables of mixed information much like an Excel spreadsheet.


2.2.3.3 Some useful data frame commands (for now)

nrow(data_frame) # retrieve the number of rows in a data frame

ncol(data_frame) # retrieve the number of columns in a data frame

data_frame$column_name # Access a specific column by it's name

data_frame[x,y] # Access a specific element located at row x, column y

rownames(data_frame) # retrieve or assign row names to your data frame

colnames(data_frame) # retrieve or assign columns names to your data frame

There are many more ways to access and manipulate data frames that we'll explore further down the road. Let's review some basic data frame code.


2.3.0 Factors codify your data into categorical variables

Ah, the dreaded factors! A factor is a class of object used to encode a character vector into categories. They are used to store categorical variables and although it is tempting to think of them as character vectors this is a dangerous mistake.

Factors make perfect sense if you are a statistician designing a programming language (!) but to everyone else they exist solely to torment us with confusing errors. A factor is really just an integer vector or character data with an additional attribute, called levels(), which defines the possible values.

2.3.0.1 Why use factors?

Why not just use character vectors, you ask?

Believe it or not factors do have some useful properties. For example, factors allow you to specify all possible values a variable may take even if those values are not in your data set. Think of conditional formatting in Excel. We also use them heavily in generating statistical analyses and in grouping data when we want to visualize it.

For more information about factors, check out the appendix!


2.6.0 Special data: NA and NaN values

Missing values in R are handled as NA or (Not Available). Impossible values (like the results of dividing by zero) are represented by NaN (Not a Number). These types of values can be considered null values. These two types of values, specially NAs, have special ways to be dealt with otherwise it may lead to errors in the functions.

For our purposes, we are not interested in keeping NA data within our datasets so we will usually detect and remove them or replace them within our data after it is imported.

2.6.1 Helpful functions and information for dealing with NA data

  1. is.na() returns a logical vector reporting which values from your query are NA.
  2. complete.cases() returns a logical for row without any NA values.
  3. Some functions can ignore NA values with the na.rm = TRUE parameter: ie mean(), sum() etc.
  4. Addtional functions in the tidyr package can also be used to work with NA values.

3.0.0 Welcome to the tidyverse

Let's begin with some definitions:

latrines_wide_to_long.png

In data science, long format is preferred over wide format because it allows for an easier and more efficient subset and manipulation of the data. To read more about wide and long formats, visit here.

Why tidy data?

Data cleaning (or dealing with 'messy' data) accounts for a huge chunk of data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.

Tidy data has:

  1. Each variable forms a column.
  2. Each observation form a row.
  3. Every cell is a single value.

This seems pretty straight forward, and it is. It is the datasets you get that will not be straight forward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.

3.0.1 The five most common problems with messy datasets are:


3.1.0 Opening and saving files with the readr package - "All roads lead to Rome.."

... but not all roads are easy to travel.

Depending on format, data files can be opened in a number of ways. The simplest methods we will use involve the readr package as part of the tidyverse. These functions have already been developed to simplify the import process for users. The functions we will use most often are:

Remember: to learn more about a function you can type it into the console ie ?read_csv and it will bring up a help page on that function!

Let's read in a dataset that we can convert from wide to long format.


3.1.1 A quick look at our SARS-CoV-2 public health unit data

From looking at our data public health unit data, we can see that it begins tracking on 2020-03-24 and goes up until 2021-03-12. In total there are observations for 354 days across 34 public health units. The final column appears to be a tally running for total cases across all PHUs reported on that date.

From the outset, we can see there are some issues with the data set that we'll want to resolve and we'll work through some tidyverse functions in order to do that. First let's quickly review some of the potential problems with our dataset.

  1. There are 34 public health units and a total count for each date. It is preferable for data visualization to collapse all of those public health units into a single variable so that we have a single value new_cases for each Date observation. At the same time we will not collapse Total into that same variable.
  1. The data is rife with NA values. Many instance are likely due to no data being collected on those dates. For our purposes, it may be simpler to replace them with a value of 0.
  1. Our public health unit names are clunky. We should trim them down to simpler region names.

Before we tackle these issues, let's go ahead and review some of the tools at our disposal.


3.2.0 The tidyverse package and it's contents make manipulating data easier

While the tidyverse is composed of multiple packages, we will be focused on working with a subset of these: dplyr, tidyr, and stringr.

To save on memory and to help make our code more concise, we should also discuss the use of the %>% symbol. This is a redirction or pipe symbol similar to the | in Unix operating systems and is used for redirecting output from one function to the input of another. By thoughtfully combining this with other commands, we can alter or query our datasets with ease.

Note that many times in the code we will assume we are piping to the first parameter of a function. We can explicity pass the redirected data to somewhere else by using the period ie ..

dplyr has functions for accessing and altering your data

tidyr has additional functions for reshaping our data

stringr provides functionality for searching data based on regular expressions


3.2.1 Reformat a wide table with pivot_longer()

Previously you may have used gather() from the tidyr package to melt wide data into a long format. Today we will use an actively developed version of this function called pivot_longer() which, for our purposes, will rely on three pieces parameters:

  1. data: the data frame (and columns) that we wish to transform.
  2. names_to: the variable name of the new column to hold the collapsed information from our current columns.
  3. values_to: The variable name of the values for each observation that we are collapsing down.

We'll be using a series of %>% so for now we won't save our work to a new object.


3.2.2 Replace NA values from our data with replace_na()

Our conversion to long format creates 11,764 observations relating a Date to a new_cases value in a specific Public_Health_Unit (or total). From the looks of our data, however, we have a number of NA values under our new_cases variable. Let's replace the cases with a new value 0 with replace_na(). This function will need two parameters:

  1. data: the data frame or vector that it will scan for NA values.
  2. replace: the value that we will use to replace NA.

3.2.3 Reformat our public health unit names with str_remove_all()

Looking at our PHU names, we can see that there is a lot of redundancy in our names. We see they end in some form of:

We also see the odd , here and there but we'll leave that alone for now.

We have a couple of choices but we can either use str_replace_all() or a specific version of that str_remove_all() which simplies replaces a pattern with an empty character. For str_replace_all() we will supply:

  1. string: a single string or vector of strings.
  2. pattern: the pattern we wish to search for in the form of a string or regular expression.
  3. replace: the replacement string we wish to use.

3.2.4 rename() variables for clarity

Now that we have the basic structure for our data, we want to clean it up just a little bit by renaming our Total column to clarify that it represents total new cases across all PHUs for that date. Why did we keep this column separate? Now we can use this information to generate percentage totals for each PHU if we choose to.

We'll use rename() from dplyr to accomplish the task of renaming our column. There are a number of ways you could accomplish this without using dplyr but the simplicity of it is nice.


3.2.5 Reorder your columns with relocate()

The last cleanup we can accomplish with our data is to move total_phu_new to the last column of our data frame. This is for personal preference but also makes more sense when simply looking at the data. The relocate() verb from dplyr accomplishes this with ease since we are not dropping or removing columns. It uses some extra syntax to help accomplish its functions:

  1. .data: the data frame or tibble we want to alter
  2. ...: the columns we wish to move
  3. .before or .after: determines the destination of the columns. Supplying neither will move columns to the left-hand side.

In fact, relocate() can be used to rename a column as well but it will also be moved by default so consider the ramifications of such an action!


3.3.0 Save your data to a file - "Country roads... save to home!"

At this point we have completed the data wrangling we want to accomplish on this dataset. We've converted it to a long-format and renamed the PHU entries while removing an NA values that may cause issues. There are a number of ways we could save this data now either as a text file or in its current form as a data frame in a .RData format.

Let's try some of those methods now.


3.3.0.1 readxl and writexl for working with excel spreadsheets

Not all of your data may come as a comma- or tab-delimited format. In the case of excel spreadsheets there are some packages available that can also facilitate the parsing of these more complex files. The readxl package is part of the tidyverse but writexl package is not. There are other means of writing to an excel file format but they are dependent on other programs or drivers.

From the readxl package

From the writexl package (not a part of the tidyverse) but independent of Java and Excel


4.0.0 Simple graphical analysis of data with ggplot2

We now have some data in a tidy format that we'd like to visualize. We can begin with some initial analyses of the data using the ggplot2 package. It has all of the components we need to help us decide on which data we want to focus on or keep. There are a number of way to visualize our data and here we will refresh our ggplot skills.

Basic ggplot notes:


4.1.0 Make a line graph of new cases based on each PHU across all dates

We now have a basic plot object initialized but we need to tell it how to display the data associated with it. We'll begin with a simple line graph of all the public health units across all dates within the set.

In order to update or add layers to a ggplot object, we can use the + symbol for each command. For instance, to define the source of x-axis and y-axis data, we use aes() command to update the aesthetics layer. Remember how we defined the public_health_unit variable as a factor? We'll take advantage of that here and tell ggplot to give each PHU it's own colour.

After defining our aesthetics, we still need to tell ggplot how to actually graph the data. The ggplot package comes with an abundance of visualizations accessed through the geom_*() commands. Some examples include

4.2.0 Use the facet_wrap() command to break PHUs into separate graphs

There's a lot of data on that graph and some of it is quite drowned out because of the scale of PHUs with many more cases. To break out each PHU individually, we can add the facet_wrap() command. We'll also update some of the parameters:

At the same time, we'll also get rid of the legend since each individual graph will be labeled by its PHU.

4.3.0 Use the ggsave() command to save your plots to a file

There are a number of ways you can use the ggsave() command to specify how you want to save your files.

4.4.0 Barplots can be used to summarize your data across PHUs

Although we do have a running total for each date, what if we want to look at the totals cases across subsets of the PHUs? Using a barplot we can stack cases by date and get a sense of daily case totals from which sets of PHUs we desire.

This time we will use geom_bar() to display our data and tell it to use the values from our new_cases variable to generate the totals. We do this by setting the stat = "identity" parameter.

At the same time, let's update our colours to use a colour-blind friendly palette scheme.

4.4.1 Alter your bin widths by transforming your x-axis

From above we get a sense of overall totals for some PHU distributions but it's still too much to look at. Let's transform our x-axis values so we can bin by months instead. To accomplish this we'll use the as.yearmon() function found in the zoo package we loaded at the beginning of the talk.

4.5.0 Filter your data for what you want to display

Now that we have taken an initial look at our data, we can see that even after converting our axis to a month-year format, it appears that some of the data isn't that relevant for us. Some of the PHUs are not generating many new cases per day so we can now consider slicing our data up to look at specific regions.

Let's look at the top 10 regions by total caseload across the dataset.

4.5.1 Use the filter() command to make a subset of our data

Now that we have a list of PHUs ordered by descending total cases, we can use that to filter our covid_phu_long.df dataframe and graph only the more heavily infected PHUs. We can then pipe the filtere graph over to make a ggplot() object. At the same time we'll do a few more things:

  1. Reorder our factors so that the bars and legend display the PHUs in ascending order by new cases.
  2. Add some additional x-axis and y-axis labels and a title.

4.6.0 Looking at the effect of lockdown on new cases

We can see from our first graph of daily case loads that there can be quite a bit of variability from day to day. Rather than look at the daily tally of new cases, perhaps we can take into account the overall number of new cases appearing in a 14-day sliding window. Given that symptoms from time of infection can take between 5-14 days to manifest, then a portion of daily positive cases can be the result of infection going back as far as 14-days. Taking a look at a 14-day window will also smooth out our data as a line graph.

To accomplish this we'll need to perform some transformations on our dataset.

  1. Ensure our data is grouped by public health unit
  2. Summarise our data in sliding windows of 14-day length

We'll want to track observations by:

4.6.1 Plot our windowed data as a line graph

Now that we've generated our windowed data, let's plot the top 5 PHUs by caseload. Let's also annotate some dates from the 2020 pandemic history:

Here's what we'll do:

  1. Plot the windowed data filtered by the top 5 PHUs
  2. Clean up the graph a little bit by "simplifying" the themes
  3. Annotate 3 dates from the pandemic timeline

5.0.0 Taking your explorations further

Today we've covered just a small example of how to import, format, and visualize data from outside sources. There are, however, a number of visualizations and avenues we haven't explored. Some other popular visualizations to consider

There are more advanced packages that simplify things like

5.1.0 SARS-CoV-2 Data Resources

There are a number of potential COVID-19 data resources out there but here are some comprehensive ones related to Canada and beyond

  1. The province of Ontario makes summaries of it's COVID-19 case data available here

  2. A talented graduate student, Jean-Paul R. Soucy cofounded the COVID-19 Canada Open Data Working Group and maintains a GitHub repository of COVID-19 statistics used by the Canadian Federal government. It is updated on a daily basis. His homepage is great too!

  3. The National Center for Biotechnology Information (NCBI) is the definitive source for curated genomic DNA from all areas of life. It also has a dedicated SARS-CoV-2 portal for accessing all of its resources (sequencing data, publications etc.) on SARS-CoV-2.

5.2.0 Get out there and discover something cool!!!

The R programming language provides a great framework for data analysis. There are pre-existing packages that can facilitate your analysis of biological data and powerful tools for the statistical analysis and modeling of the growing datasets generated by the pandemic.

We've only had time to scrape the surface of what this language can do for you but you now also have a platform for practicing and growing your skills in this language!

I've included a large appendix covering some extra examples and finer details that we had to forego in this presentation where you can explore the syntax and language of R as well as another example of data cleanup and visualization with real-world COVID-19 data.

You can also find a code-complete version of this talk in HTML format here

CAGEF_services_slide.png

6.0.0 Appendix 1: Another example of data cleaning and visualization

6.1.0 What is the state of the current variants?

Let's switch gears and take a look at another dataset from Ontario. Rather than breaking down cases by public health unit, this tracks total cases across Ontario along with different categories such as hospitalizations, long-term care facilities, and some of the more recent variants.

Looking deeply at the variant information, it appears that Ontario begins tracking variant data on 2021-01-29. Let's build a dataset from that point onwards

6.2.0 Rename your column names using rename_with()!

As you can see all of the column names are problematic for working with. We should replace all of the white-space characters and dashes with the undescore character. The R interpreter hates white space... and this will make working with the column names much easier for us.

6.3.0 How do we track the rise variants?

The data provided by the Ontario government represents a daily cumulative tally of the 3 variants of concern: B.1.1.7, B.1.351, and P.1. We want to convert these numbers into a daily incidence value. We'll be going through a large number of transformations to accomplish this.

At the same time we want to figure out how many new cases are being reported daily. We can use a the diff() function on a vector to subtract neighbour elements from each other and we'll take advantage of that with the total_cases column.

The GH variant represents the dominant strain of SARS-CoV-2 in North America for most of 2020. From our calculation of daily new cases, we can also estimate the number of GH variant cases by subtracting the other variants reported on that day.

After generating those 4 values for each day, we'll convert the table to a long format so we can graph our data.

6.3.1 Plot our data separately for each variant

Since the number of variant cases are all on very different scales, it's better if we facet by variant and look at the daily case numbers.

6.4.0 Smooth our data using a 7-day average

Looks like there is a lot of variation in the day-to-day reporting. This could be a result of the process of how/when samples are sent for variant testing. Maybe a 7-day window will look better? Much like how we smoothed out our PHU data, using mean case number across a sliding window would perhaps look better.

We'll just recycle some of our code from earlier.

6.4.1 Plot our smoothed data with lines and bar graphs

We'll wrap up this example with the smoothed 7-day window data plotted as a line graph and we'll fill the empty space below the line graph with bars representing the 7-day average data as well.


7.0.0 Appendix 2: Improving your code readability

7.1.0 Making Life Easier

Let's discuss some important behaviours before we begin coding:

7.1.1 Annotate your code with #

Why bother?

Your worst collaborator is potentially you in 6 days or 6 months. Do you remember what you had for breakfast last Tuesday?

You can annotate your code for selfish reasons, or altruistic reasons, but annotate your code.

How do I start?

Comments may/should appear in three places:


# At the beginning of the script, describing the purpose of your script and what you are trying to solve

bedmasAnswer <- 5 + 4 * 6 - 0 #In line: Describing a part of your code that is not obvious what it is for.

Maintaining well-documented code is also good for mental health!


7.1.2 Naming conventions for files, objects, and functions

Basically, you have the following options:

The most important aspects of naming conventions are being concise and consistent!


7.1.3 Best Practices for Writing Scripts


7.2.0 Trouble-shooting basics

We all run into problems. We'll see a lot of mistakes happen in class too! That's OK if we can learn from our errors and quickly (or eventually) recover.

7.2.1 Common errors


7.2.2 Finding answers online

7.2.2.1 Asking a question

Remember: Everyone looks for help online ALL THE TIME. It is very common. Also, with programming there are multiple ways to come up with an answer, even different packages that let you do the same thing in different ways. You will work on refining these aspects of your code as you go along in this course and in your coding career.

Last but not least, to make life easier: Under the Help pane, there is a cheatsheet of Jupyter notebook keyboard shortcuts or a browser list here.


8.0.0 Appendix 3: More foundational basics of R

8.1.0 Lists are amorphous bundles strung together with code

Lists can hold mixed data types of different lengths. These are especially useful for bundling data of different types for passing around your scripts, to functions, or receiving output from functions! Rather than having to call multiple variables by name, you can store them in a single list!

If you forget what is in your list, use the str() function to check out its structure. It will tell you the number of items in your list and their data types.

8.1.0.1 Accessing elements from a list is accomplished in multiple ways

Accessing lists is much like opening up a box of boxes of chocolates. You never know what you're gonna get when you forget the structure!

You can access elements with a mixture of number and naming annotations much like data frames. Also [[x]] is meant to access the xth "element" of the list.


8.2.0 More facts about factors

8.2.1 Specify factors and their levels explicitly during or after data frame creation

You can specify which columns of strings are converted to factors at the time of declaring your column information. Alternatively you can coerce character vectors to factors after generating them.

R by default puts factor levels in alphabetical order. This can cause problems if we aren't aware of it. You can check the order of your factor levels with the levels() command. Furthermore you can specify, during factor creation, your level order.

Always check to make sure your factor levels are what you expect.

With factors, we can deal with our character levels directly, or their numeric equivalents.


8.2.2 Even more facts about factors

  1. Use levels() to list the levels and their order for your factor
  1. To rename levels of a factor, declare and reassign you factor.
  1. Move a single level to the first position within your factor levels with relevel().
  1. Factor levels can be assigned an order of precedence during their creation with the parameter ordered = TRUE.
  1. Define labels for your factor during their creations with the parameter labels = c(). Note that level order is assigned before labels are added to your data. You are essentially labeling the integer assigned to your factor levels so be careful when using this parameter!

8.3.0 Mathematical operations on data frames and arrays

Yes, you can treat data frames and arrays like large lists where mathematical operations can be applied to individual elements or to entire columns or more!

8.3.1 Mathematical operations are applied differently depending on data type

Therefore be careful to specify your numeric data for mathematical operations.


8.4.0 Using the apply() family of functions to perform actions across data structures

The above are illustrative examples to see how our different data structures behave. In reality, you will want to do calculations across rows and columns, and not on your entire matrix or data frame.

8.4.1 The apply() function will recognize basic functions and use them on vectorized data

For example, we might have a count table where rows are genes, columns are samples, and we want to know the sum of all the counts for a gene. To do this, we can use the apply() function. apply() Takes an array, matrix (or something that can be coerced to such, like a numeric data frame), and applies a function over row (MARGIN = 1) or columns (MARGIN = 2). Here we can invoke the sum function.


8.4.2 The other members of the apply() family

There are 3 additional members of the apply() family that perform similar functions with varying outputs

  1. lapply(data, FUN, ...) is usuable on dataframes, lists, and vectors. It returns a list as output.

    • It will coerce non-list objects to a list
    • Additional arguments to FUN will be applied from the ...
  2. sapply(data, FUN, ...) works similarly to lapply() except it tries to simplify the output to the most elementary data structure possible. ie it will return the simplest form of the data that makes sense as a representation.

  1. mapply(FUN, data, ...) is short for "multivariate" apply and it applies a function to multiple lists or multiple vector arguments.

9.0.0 Appendix 4: Instructions for installing your own software

9.1.0 Jupyter Notebooks and the R kernel

For this introductory course we will be teaching and running code for R through Jupyter notebooks. In this section we will discuss

  1. Installation of Jupyter (through Anaconda)
  2. Updating the default R package
  3. Starting up your Jupyter notebooks

9.1.1 Installing R and Jupyter Notebooks (via Anaconda3)

As of 2021-01-18, The latest version of Anaconda3 runs with Python 3.8

Download the OS-appropriate version from here https://www.anaconda.com/products/individual

9.1.2 Updating the base version of R

As of 2020-12-11, the lastest version of r-base available for Anaconda is 4.0.3 but Anaconda comes pre-installed with R 3.6.1. To save time, we will update just our r-base (version) through the command line using the Anaconda prompt. You'll need to find the menu shortcut to the prompt in order to run these commands. Before class you should update all of your anaconda packages. This will be sure to get you the latest version of Jupyter notebook. Open up the Anaconda prompt and type the following command:

conda update --all

It will ask permission to continue at some point. Say 'yes' to this. After this is completed, use the following command:

conda install -c conda-forge/label/main r-base=4.0.3=hddad469_3

Anaconda will try to install a number of R-related packages. Say 'yes' to this.

9.1.3 Loading the R-kernel for your Jupyter notebook

Lastly, we want to connect your R version to the Jupyter notebook itself. Type the following command:

conda install -c r r-irkernel

Jupyter should now have R integrated into it. No need to build an extra environment to run it.

9.1.3.1 A quick note about Anaconda environments

You may find that for some reason or another, you'd like to maintain a specific R-environment (or other) to work in. Environments in Anaconda work like isolated sandbox versions of Anaconda within Anaconda. When you generate an environment for the first time, it will draw all of its packages and information from the base version of Anaconda - kind of like making a copy. You can also create these in the Anaconda prompt. You can even create new environments based on specific versions or installations of other programs. For instance, we could have tried to make an environment for R 4.0.3 with the command

conda create -n my_R_env -c conda-forge/label/main r-base=4.0.3=hddad469_3

This would create a new environment with version 4.0.3 of R but the base version of Anaconda would retain version 3.6.1 of R. A small but helpful detail if you are unsure about newer versions of packages that you'd like to use.

Likewise, you can update and install packages in new environments without affecting or altering your base environment! Again it's helpful if you're upgrading or installing new packages and programs. If you're not sure how it will affect what you already have in place, you can just install them straight into an environment.

For more information: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#cloning-an-environment

9.1.3.2 Using the Anaconda navigator to make a Jupyter notebook

If you are inclined, the Anaconda Navigator can help you make an R environment separate from the base, but you won't be able to perform the same fancy tricks as in the prompt, like installing new packages directly to a new environment.

Note: You should consider doing this only if you have a good reason to isolate what you're doing in R from the Anaconda base packages. You will also need to have installed r-base 4.0.3 to make a new environment with it through the Anaconda navigator.

The Anaconda navigator is a graphical interface that shows all fo your pre-installed packages and give you access to installing other common programs like RStudio (we'll get to that in a moment).

You will now have an R environment where you can install specific R packages that won't make their way into your Anaconda base.

You will likely find a shortcut to this environment in your (Windows) menu under the Anaconda folder. It will look something like Jupyter Notebook (R-4-0-3)

9.1.3.3 Installing packages for your personal Jupyter Notebook

Normally I suggest avoiding installing packages through your Jupyter Notebook. Instead, if you want to update your R packages for running Jupyter, it's best to add them through either the Anaconda prompt or Anaconda navigator. Again, using the prompt gives you more options but can seem a little more complicated.

One of the most useful packages to install for R is r-essentials. Open up the Anaconda prompt and use the command: conda install -c r r-essentials. After running, the Anaconda prompt will inform you of any package dependencies and it will identify which packages will be updated, newly installed, or removed (unlikely).

Anaconda has multiple channels (similar to repositories) that exist and are maintained by different groups. These various channels port over regular R packages to a format that can be installed in Anaconda and run by R. The two main channels you'll find useful for this are the r channel and conda-forge channel. You can find more information about all of the packages on docs.anaconda.com. As you might have guessed the basic format for installing packages is this: conda install -c channel-name r-package where

conda-install is the call to install packages. This can be done in a base or custom environment -c channel-name identifies that you wish to name a specific channel to install from r-package is the name of your package and most of them will begin with r- ie r-ggplot2


9.2.0 R and RStudio

9.2.1 Installing R

As of 2020-06-25, the latest stable R version is 4.0.3:

Windows:

- Go to <http://cran.utstat.utoronto.ca/>      
- Click on 'Download R for Windows'     
- Click on 'install R for the first time'     
- Click on 'Download R 4.0.3 for Windows' (or a newer version)     
- Double-click on the .exe file once it has downloaded and follow the instructions.

(Mac) OS X:

- Go to <http://cran.utstat.utoronto.ca/>      
- Click on 'Download R for (Mac) OS X'     
- Click on R-4.0.3.pkg (or a newer version)     
- Open the .pkg file once it has downloaded and follow the instructions.


Linux:

- Open a terminal (Ctrl + alt + t)
- sudo apt-get update     
- sudo apt-get install r-base     
- sudo apt-get install r-base-dev (so you can compile packages from source)


9.2.2 Installing RStudio

As of 2021-01-18, the latest RStudio version is 1.4.1103

Windows:

- Go to <https://www.rstudio.com/products/rstudio/download/#download>     
- Click on 'RStudio 1.3.1093 - Windows Vista/7/8/10' to download the installer (or a newer version)     
- Double-click on the .exe file once it has downloaded and follow the instructions.

(Mac) OS X:

- Go to <https://www.rstudio.com/products/rstudio/download/#download>     
- Click on 'RStudio 1.3.1093 - Mac OS X 10.13+ (64-bit)' to download the installer (or a newer version)     
- Double-click on the .dmg file once it has downloaded and follow the instructions.     


Linux:

- Go to <https://www.rstudio.com/products/rstudio/download/#download>     
- Click on the installer that describes your Linux distribution, e.g. 'RStudio 1.3.1093 - Ubuntu 18/Debian 10(64-bit)' (or a newer version)     
- Double-click on the .deb file once it has downloaded and follow the instructions.     
- If double-clicking on your .deb file did not open the software manager, open the terminal (Ctrl + alt + t) and type **sudo dpkg -i /path/to/installer/rstudio-xenial-1.3.959-amd64.deb**

 _Note: You have 3 things that could change in this last command._     
 1. This assumes you have just opened the terminal and are in your home directory. (If not, you have to modify your path. You can get to your home directory by typing cd ~.)     
 2. This assumes you have downloaded the .deb file to Downloads. (If you downloaded the file somewhere else, you have to change the path to the file, or download the .deb file to Downloads).      
 3. This assumes your file name for .deb is the same as above. (Put the name matching the .deb file you downloaded).

If you have a problem with installing R or RStudio, you can also try to solve the problem yourself by Googling any error messages you get. You can also try to get in touch with me or the course TAs.


9.2.3 Getting to know the RStudio environment

RStudio is an IDE (Integrated Development Environment) for R that provides a more user-friendly experience than using R in a terminal setting. It has 4 main areas or panes, which you can customize to some extent under Tools > Global Options > Pane Layout:

  1. Source - The code you are annotating and keeping in your script.
  2. Console - Where your code is executed.
  3. Environment - What global objects you have created and functions you have written/sourced.
    History - A record of all the code you have executed in the console.
    Connections - Which data sources you are connecting to. (Not being used in this course.)
  4. Files, Plots, Packages, Help, Viewer - self-explanatoryish if you click on their tabs.

All of the panes can be minimized or maximized using the large and small box outlines in the top right of each pane.

R_studio_default_layout.jpg

9.2.3.1 Source

The Source is where you are keeping the code and annotation that you want to be saved as your script. The tab at the top left of the pane has your script name (i.e. 'Untitled.R'), and you can switch between scripts by toggling the tabs. You can save, search or publish your source code using the buttons along the pane header. Code in the Source pane is run or executed automatically.

To run your current line of code or a highlighted segment of code from the Source pane you can:
a) click the button 'Run' -> 'Run Selected Line(s)',
b) click 'Code' -> 'Run Selected Line(s)' from the menu bar,
c) use the keyboard shortcut CTRL + ENTER (Windows & Linux) Command + ENTER (Mac) (recommended),
d) copy and paste your code into the Console and hit Enter (not recommended).

There are always many ways to do things in R, but the fastest way will always be the option that keeps your hands on the keyboard.

9.2.3.2 Console

You can also type and execute your code (by hitting ENTER) in the Console when the > prompt is visible. If you enter code and you see a + instead of a prompt, R doesn't think you are finished entering code (i.e. you might be missing a bracket). If this isn't immediately fixable, you can hit Esc twice to get back to your prompt. Using the up and down arrow keys, you can find previous commands in the Console if you want to rerun code or fix an error resulting from a typo.

On the Console tab in the top left of that pane is your current working directory. Pressing the arrow next to your working directory will open your current folder in the Files pane. If you find your Console is getting too cluttered, selecting the broom icon in that pane will clear it for you. The Console also shows information: upon start up about R (such as version number), during the installation of packages, when there are warnings, and when there are errors.

9.2.3.3 Environment

In the Global Environment you can see all of the stored objects you have created or sourced (imported from another script). The Global Environment can become cluttered, so it also has a broom button to clear its workspace.

Objects are made by using the assignment operator <-. On the left side of the arrow, you have the name of your object. On the right side you have what you are assigning to that object. In this sense, you can think of an object as a container. The container holds the values given as well as information about 'class' and 'methods' (which we will come back to).

Type x <- c(2,4) in the Console followed by Enter. 1D objects' data types can be seen immediately as well as their first few values. Now type y <- data.frame(numbers = c(1,2,3), letters = c("a","b","c")) in the Console followed by Enter. You can immediately see the dimension of 2D objects, and you can check the structure of data frames and lists (more later) by clicking on the object's arrow. Clicking on the object name will open the object to view in a new tab. Custom functions created in session or sourced will also appear in this pane.

The Environment pane dropdown displays all of the currently loaded packages in addition to the Global Environment. Loaded means that all of the tools/functions in the package are available for use. R comes with a number of packages pre-loaded (i.e. base, grDevices).

In the History tab are all of the commands you have executed in the Console during your session. You can select a line of code and send it to the Source or Console.

The Connections tab is to connect to data sources such as Spark and will not be used in this lesson.

9.2.3.4 Files, Plots, Packages, Help, Viewer

The Files tab allows you to search through directories; you can go to or set your working directory by making the appropriate selection under the More (blue gear) drop-down menu. The ... to the top left of the pane allows you to search for a folder in a more traditional manner.

The Plots tab is where plots you make in a .R script will appear (notebooks and markdown plots will be shown in the Source pane). There is the option to Export and save these plots manually.

The Packages tab has all of the packages that are installed and their versions, and buttons to Install or Update packages. A check mark in the box next to the package means that the package is loaded. You can load a package by adding a check mark next to a package, however it is good practice to instead load the package in your script to aid in reproducibility.

The Help menu has the documentation for all packages and functions. For each function you will find a description of what the function does, the arguments it takes, what the function does to the inputs (details), what it outputs, and an example. Some of the help documentation is difficult to read or less than comprehensive, in which case goggling the function is a good idea.

The Viewer will display vignettes, or local web content such as a Shiny app, interactive graphs, or a rendered html document.

9.2.3.5 Global Options

I suggest you take a look at Tools -> Global Options to customize your experience.

For example, under Code -> Editing I have selected Soft-wrap R source files followed by Apply so that my text will wrap by itself when I am typing and not create a long line of text.

You may also want to change the Appearance of your code. I like the RStudio theme: Modern and Editor font: Ubuntu Mono, but pick whatever you like! Again, you need to hit Apply to make changes.

That whirlwind tour isn't everything the IDE can do, but it is enough to get started.